Short-Term Memory in Orthogonal Neural Networks 
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We study the ability of linear recurrent networks obeying discrete time dynamics to store long 
temporal sequences that are retrievable from the instantaneous state of the network. We calculate 
this temporal memory capacity for both distributed shift register and random orthogonal connec- 
tivity matrices. We show that the memory capacity of these networks scales with system size. 
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The brain holds information in short-term memory for 
use in prospective action. It is thought that persistent 
firing patterns in cortical networks subserve such work- 
ing memory and several mechanisms have been proposed 
In some, a stimulus, such as a spoken word or a pic- 
ture activates a pattern of neuronal activity that persists 
for several seconds because it corresponds to an isolated 
stable fixed point of the dynamics [2j. However, work- 
ing memory often involves memorization of graded sig- 
nals, such as stimulus location in 2D space or the eye's 
gaze, which are hard to associate with discrete attrac- 
tors. Recent models propose that short-term memory is 
associated with low dimensional continuous manifolds of 
attractors. Both network |3| and single cell mechanisms 
have been implicated in generating these manifolds. 

More recently it was suggested that a generic recur- 
rent network can store arbitrary temporal inputs in its 
transient responses, even though these responses do not 
correspond to attractors, and that working memory op- 
erates by reconstructing input history from the network's 
current state || @ . This proposal has been investigated 
numerically for some recurrent networks but its theoret- 
ical underpinnings are unexplored. For instance, how 
does a network's storage capacity for temporal memory 
scale with system size? What network architectures are 
suitable? How do noise and structural perturbations af- 
fect memory? Such issues have important implications 
for general dynamical systems. To what extent can the 
history of perturbations on a dissipative system be recon- 
structed from its current state? How does the transient 
memory of a dynamical system depend on its number of 
degrees of freedom and the amount of noise? In this pa- 
per we develop theoretical understanding of the capacity 
of linear recurrent networks to store temporal signals. 

Model: We consider here the discrete time model pro- 
posed by Jaeger 0. A time dependent scalar signal, s(n), 
is memorized by a linear recurrent network with N neu- 
rons obeying the discrete time dynamics: 

x(n) = Wx(n — 1) + vs(n) + z(n). (1) 

x(n) gives the network state at time n, v is a unit norm 




FIG. 1: Architecture of the short-term memory network. A 
single input unit, s, feeds into an N unit recurrent network. 
Each of an array of readout units reconstructs the input at a 
given past time. 



constant vector of connections from the input source, W 
is the matrix of recurrent connections which need not be 
symmetric, and z(n) is a noise vector. To ensure dy- 
namical stability, we require a < 1, where a is the norm 
squared of W's largest eigenvalue. The goal is to extract 
the scalar history of the signal, {s(m)\m < n}, from the 
current network state x(n) . This is achieved using a layer 
of linear readout neurons, the state of which at time n is 
given by {yk{n) = u£x(n),fc = 0, 1,2, ...}, see Fig.ffl u fc 
is a constant vector of output connections from the re- 
current network to the fc-th readout neuron. It is chosen 
to minimize mean square deviations (\yk(n) — s(n — k)\ 2 ) n 
so that ykin) is close to s(n — fc), where (. . .)„ denotes 
a time average. The resultant optimal output weight 
is Uk = C _1 pi i where C = ^x(n)x T (n)) ji is the covari- 
ance matrix of x, and = (s(n — fc)x(n)) . The abil- 
ity to embed signals in the network may depend on 
their statistics. Here we characterize the signal ensem- 
ble by (s(n)) n = and (s(n)s(n + k)) n = Sk,o- The 
noise vectors have zero mean, (z(n))„ = and variance 
(zi(n)zj(n + k) T ) n = eSk,oSi.j. With the above signal and 
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noise statistics, pk = W fc v, and 



Distributed Shift Register 



(2) 



k=0 



where the scaled noise covariance is C n = J^k W fe W' cT . 

Memory function: We define the system's memory 
function as the overlap between the past input and its 
reconstructed value, m{k) = (s(n — k)yk{n)) n . With 
the above statistics, 



m ( k ) = Pk c 1 Pk- 



(3) 



m(k) = 1 corresponds to perfect reconstruction, j/fc(n) = 
s(n — k), whereas m(k) = indicates no memory of s (ri- 
fe). For an arbitrary, stable connection matrix W, 
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i(k) = TrCT 1 J2 PkPk =N- eTrC-^n. (4) 



k=0 



This sum rule provides a useful indication of the net- 
work's short term memory. For zero noise, the area under 
the memory function is exactly N, implying that all N 
degrees of freedom are useful for storage. Storage capac- 
ity decreases with strength of noise. System performance 
also relates to the shape of m(fc). Since < m(fc) < 1, 
it follows that the length of signal that can be exactly 
reproduced is also bounded. To characterize the length 
of time over which a signal can be retrieved with rea- 
sonable accuracy, we define the temporal capacity fee as 
the minimum value of fc such that rn(fc) < |. We focus 
particularly on the conditions under which the system's 
capacity is extensive, namely kc oc N as N — > oo. For 
given noise e, we define a opt as the value of a at which 
capacity achieves its maximum, k°Q . 

Distributed Shift Register (DSR) Network: A 
straightforward candidate for a short-term memory sys- 
tem is a delay line, or a shift register network, which 
corresponds to a one-dimensional network with Wij = 
y/aSij+i, and Vi = Si t \. One drawback of this system is 
its extreme sensitivity to removal of a single neuron. A 
more robust, distributed architecture of the shift register 
operation is a fully connected network with 
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w = v^Y, v(k+1)v 



(k)T. 



,(1) 



(5) 



k=l 



where {v^ k ^} is an arbitrary set of N orthonormal vec- 
tors. Note that Wv< k ) = Vav< k+1 ) for k < N - 1 and 
Wv N = 0, implying that W" = |2|. In this network 
the covariance matrix is 
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C = ^[ a fe - 1 +?(l- a fe )] v«v 



(k)T 



(6) 



fc=i 



where e — e/(l — a). The memory function is given by 

m(k)= 1 - T r, k = 0,...,N-l (7) 

a K + e(l — a fc + i ) 
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FIG. 2: Memory capacities of DSR and Orthogonal networks 
for e — 10~ 4 and JV = 400 at various values of a. For the 
Orthogonal network, the circles show simulation results and 
the solid lines show predictions of the Annealed Approxima- 
tion, which begins to break down for a very near 1. While the 
memory of the DSR improves with increasing a, the capacity 
of the orthogonal network peaks at a opt = 0.98. 



and m(k) = for k > N. In zero noise, x(n) = 
SfcLrj 1 a k l 2 s{n — fc)v( k+1 ). Thus the network embeds 
each of the previous N signal values in a distinct or- 
thogonal direction v^ k ) for k up to N — 1. An important 
question in both this and the following models is how the 
value of a affects the system performance. In the absence 
of noise the present model retrieves perfectly the most 
recent N inputs for all values of a, as implied by Eq. {7\ 
However, the required readout weights Uk =ct~ fe / 2 v( k+1 ) 
for the retrieval of these memories increase with decreas- 
ing a, limiting the choice of a to values close to 1. 

Non-zero noise contributes to x(n), polluting the sig- 
nal but leaving fixed the directions along which temporal 
signals are embedded. When e > 0, m(k) < 1 for all 
k. The capacity kc is greater than for all e < 1. It 
increases with decreasing e and saturates to kc = N at 
zero noise. For fixed noise, increasing a increases signal- 
to-noise ratio and hence m(k) increases, see Fig- El Thus 
a opt _ j £ or gji values of noise e > 0. 

Random Orthogonal Network: We next ask 
whether broader classes of connection matrices can also 
store long temporal signals. A plausible extension of the 
above model is to a network with W = ^/aO where O is 
an N x N orthogonal matrix (i.e., OO t = 1) and a < 1. 
Similar to the DSR model, W performs a rotation fol- 
lowed by a shrinking with factor i/a. However W fe v and 
W fc+1 v are not necessarily orthogonal, in contrast to the 
DSR. Moreover while W" = for the DSR, orthogonal 
W is full rank. Consequently, inputs from times earlier 
than N can interfere with current inputs. For any choice 
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of O and v the covariance matrix is 

oc 

C = ^VO fc vv T Cr fe +?I, e = e/(l-a). (8) 

k=0 

However the system behavior can depend upon the par- 
ticular O and v. We therefore consider O drawn from the 
Gaussian Orthogonal Ensemble (GOE) and input con- 
nections v from a Gaussian distribution with v T v = 1. 
We evaluate m(k) = (p^C _1 pk) where the average is 
over these ensembles and captures the typical behavior 
for N large. Exact analytical evaluation of m(k) is com- 
plex and requires accounting for statistical correlations 
between powers of random orthogonal matrices. In the 
following we solve the problem under the "Annealed Ap- 
proximation" (AA) , in which W and v are not quenched 
in time but drawn randomly at each time step. Under 
this approximation, 



m(k) 



a k q 



a k q 



where q satisfies: 



a k q 



1 + a 



fe=0 



k 1 e 9- 
q 



(9) 



(10) 



To see this, first note that in the annealed scenario 
Pfe = v' k ) where {v( k )} is an infinite set of independent 
random normalized vectors. Hence, 



m(k) 



a fc v (k)T 



CT 1 a fc v< k M k > T 



Co V k > 



(11) 

where Co = C — a fc v( k )v( k ) T is independent of the ran- 
dom variable V", Expanding in powers of a k and av- 
yields Eq. @ for the memory function 
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q = 



- /„(k)T f 



-i v (k) 
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( TrC o 1 ) = ^( TrC 



(12) 

The last equality holds for large N because (TrC 1 ) — 
(TrCo 1 ) ~ a fc (v( k ) T C V k )), which is only of order 
1 due to the normalization of v( k ). Eq. (|l(Jfl for q is 
obtained from the sum rule Eq. by substituting C n = 
(1 — a) -1 1. Though the annealed approximation neglects 
quenched correlations, it agrees surprisingly well with the 
numerical solution of the quenched system for all a values 
except for a — > 1, seen in the examples of Fig. [21 for 
e = 1(T 4 . 

To analyze the network's behavior, we first consider 
the limits (1 — a)/N, e/N — + for N — > oo. In this case, 
the sum of m(k) is finite so Eq. 1|1(J[) gives q = 1/e, and 
thus Eq. © reduces to 



m(k) 



k = 0,1,2,. 



(13) 




FIG. 3: Capacity per neuron as a function of p = N(l — 
a), at e = 0.04, showing non-monotonic dependence. Points 
from simulations on differently sized systems fall on essentially 
the same curve, confirming that in this regime capacity is 
extensive. The inset shows the dependence of kc /N on scaled 
noise e for p opt both in the AA and from simulation. 



Capacity is nonzero for e < 1, in which case kc = 
log (e)/ log (a). Here a opt (e) is less than unity and de- 
creases with increasing noise because it results from a 
balance between signal suppression on one hand and am- 
plification of error input e by the factor (1 — a) -1 on 
the other. For small e, maximizing kc with respect to 
a yields a opt « 1 — ee and kc(a opt ) « 1/ee. For e — ► 1, 
a opt _^ o anc j fc c _> o. Equation (|13fl is exact in the limit 
of large N if e and a are kept fixed. 

In order to yield extensive capacity, e and 1 — a must 
decrease inversely with N, so that 1 — a = p/N and 
e = e/N with p and e finite. To see that in this case 
kc scales with system size N, we write q = exp(p/z). 
Capacity kc — pN is extensive for p — 0(1) and kc = 
for p < 0, see Eq. ©. The sum rule Eq. (|10l) determines 
the value of p,. In the present regime the sum can be 
approximated by an integral, yielding 



9 = l«g(l + exp(p//)) + eexp(p/z). 



(14) 



Solving for p, yields a nonmonotic function of p (see 
Fig. OJl which attains its maximum at p opt . For p < p opt , 
p, decreases with decreasing p and p < for p smaller 
than the critical value p- = log(2) + e. For large noise, 
i.e., e > 1, p opt increases with noise level as p opt ~ ee, 
as in the finite capacity limit. However, at low noise 
levels p opt does not approach (as predicted by the fi- 
nite capacity limit) but increases with decreasing e as 
popt l log(iyg). This is because for e sufficiently small 
and a sufficiently close to 1, strong long-time interfer- 
ence prevents faithful reconstruction even of recent in- 
puts. Therefore, a opt < 1 even in the e — ► limit. Ad- 
ditionally, choosing p close to p opt reduces retrieval error 
for values of k < kc- This behavior is demonstrated in 
Fig. [3 As in the DSR model, the choice of p should 
be bounded not only by capacity limits but also by the 
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magnitudes of the output weights which increase with p 
roughly as ||iik|| ~ a~ fc / 2 = exp(pk/N) at zero noise. 

The Annealed Approximation: An interesting is- 
sue is the range of validity of the annealed approximation. 
As indicated above, finite capacity results should be ex- 
act in the large N limit with fixed noise and suppression 
coefficient. In this limit, at any given time only a small 
number of directions in x space contribute to the current 
state and since O is random, the correlations among them 
are negligible. On the other hand, when e and I — a are 
proportional to 1/iV, the number of unsuppressed modes 
is of order N and hence the combined effect of their corre- 
lations become important. Breakdown of the A A occurs 
when memory of early times begins to decrease due to 
strong long-time interference. Our simulations indicate 
that this occurs for p = N(l — a) < 10, as seen for small 
p in Figs. EJand |3| Derivation of a full quenched theory 
requires appropriate handling of the intricate correlations 
among high powers of random orthogonal matrices. 

Robustness: Our results show that systems with the 
DSR or orthogonal architectures are tolerant to stochas- 
tic noise in their network dynamics up to noise ampli- 
tudes significantly larger than X/N. An important issue 
is the sensitivity of the network to structural noise. We 
have tested numerically the robustness of the orthogo- 
nal recurrent network to neuron deletion. We find that 
this perturbation does not affect drastically the capac- 
ity of the system provided that the output weights are 
retrained after the neurons' removal, in contrast to the 
simple delay line. If the output weights are not retrained, 
however, capacity drops substantially. Therefore for this 
to be a viable model of working-memory, the system 
would need to relearn weights sufficiently quickly upon 
neuron loss 0. Note also that these results imply that 
W need not be exactly orthogonal since neuron removal 
is also a perturbation away from orthogonality. 

Random Gaussian Matrices: In this work we have 
assumed rather special network architectures. On the 
other hand, there are claims that any generic (stable) 
connection matrix W can robustly store long temporal 
signals 0, ; if substantiated this is indeed a powerful 
result. The theoretical study of more generic ensembles 
is difficult. However, our simulations of fully connected 
Gaussian random matrices indicate that their capacity is 
not extensive. If e = and a is sufficiently small then 
it is likely that m(k) = 1 for k N and zero for larger 
N, as is the case in the models studied here. This is 
because for small a, interference from long past times is 
negligible and hence the sum-rule Eq. ^implies the above 
square form for m(k). However, this capacity is unusable 
in large systems because it requires exponentially large 
output weights (reflecting the near singularity of the cor- 
relation matrix C). Taking a close to 1 so that C is 
well conditioned, results in strong fluctuations in m(k) 



and does not seem to yield extensive capacity. The pres- 
ence of noise also regularizes the system but again does 
Random Gaussian 
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FIG. 4: Numerical calculation of the memory function of 
Gaussian random matrices for e = 0.01, a = 0.999 and dif- 
ferent sizes. Results are averages over 50 realizations of the 
connectivity matrix. Exploring different values of a we find 
that up to the above mentioned value the memory improves 
with increasing a. Increasing a beyond this value results in 
irregular and highly variable m(k). 



not contribute to give extensive capacity, as indicated 
in Fig. 2| We are currently developing a fuller under- 
standing of the short-term memory properties of generic 
connectivity matrices. 
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